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£N| ' Abstract 

. This paper modifies Jaynes's axioms of plausible reasoning and derives 

. the minimum relative entropy principle, Bayes's rule, as well as maximum 

likelihood from first principles. The new axioms, which I call the Opti- 
mum Information Principle, is applicable whenever the decision maker is 
given the data and the relevant background information. These axioms 
provide an answer to the question "why maximize entropy when faced 
with incomplete information?" 
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1 Introduction 



Bayesian inference [1] and the maximum entropy principle (MaxEnt) of Jaynes 
[9] are valid methods of inference when the decision maker is faced with incom- 
plete information. Although these methodologies are quite distinct, they often 
give similar results. A few authors have hinted at the possibility of deriving both 
methods from first principles. For instance, as the sample size increases, [23] 
showed that the distribution of a random variable conditional on empirical mo- 
ment constraints (computed by Bayes's rule) converges to the minimum relative 
entropy distribution subject to the same population moment constraints. Con- 
versely, [24] showed that Bayes's rule can be derived from a variational principle 
of information processing. 

One possibility of deriving both the maximum entropy principle and Bayes's 
rule is to axiomatize plausible reasoning, as [14, 13, 3, 12] attempted. In the 
most primitive form, Jaynes [12] suggested desiderata that should be employed 
in plausible reasoning, by which he deduced Bayes's rule. To apply Bayes's rule 
we have to start from some priors, and Jaynes advocates the use of the maximum 
entropy principle to set up priors. However, there are many situations in which 
both MaxEnt and Bayesian inference are applicable. Which method should we 
take then? And do they return the same result? In this paper I propose a 
different set of axioms of plausible reasoning, by which I derive the minimum 
relative entropy principle 1 , Bayes's rule, and maximum likelihood. 
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1 To the best of my knowledge, the minimum relative entropy principle was first introduced 
by Kullback [15, p. 37] under the name the principle of minimum discrimination information. 



1 



I proceed in two steps. First, I list the desiderata of a measure of informa- 
tion gain when a decision maker updates the plausibility of a proposition upon 
receiving new information. From these desiderata I derive the functional form of 
information gain. Second, I impose the decision maker to be maximally conser- 
vative, given all the relevant information. That is, the decision maker updates 
the plausibilities by minimizing the average information gain (i.e., sticks to his 
or her prior as much as possible) subject to all relevant information, which I 
call the Optimum Information Principle. I show that the Optimum Information 
Principle implies the well-known minimum relative entropy principle, the Baycs 
rule, and also Jaynes's axioms. 



2 Axioms of Plausible Reasoning 

Viewing probability as the plausibility of a proposition dates back at least to 
Keynes [14]. As Cox [3] describes it "as if Euclid had placed the Pythagorean 
theorem among the axioms of plane geometry" , Keynes's axioms were not fun- 
damental, and have been improved by [13] and [3]. To date the most primitive 
axioms of plausible reasoning seem to be those of Jaynes [12, pp. 17-19]: 

J-I. Degrees of plausibility are represented by real numbers. 

J-II. Qualitative correspondence with common sense. 

J-III. Consistency. 

(a) If a conclusion can be reasoned out in more than one way, then 
every possible way must lead to the same result. 

(b) The robot 2 always takes into account all of the evidence it has 
relevant to a question. It does not arbitrarily ignore some of the 
information, basing its conclusions only on what remains. In other 
words, the robot is completely nonideological. 

(c) The robot always represents equivalent states of knowledge by equiv- 
alent plausibility assignments. That is, if in two problems the 
robot's state of knowledge is the same (except perhaps for the label- 
ing of the propositions) , then it must assign the same plausibilities 
in both. 

Desideratum II means the following. If we denote the plausibility of a proposi- 
tion A given information / by p(A\I), then 

p{A\C) > p(A\C) => p{pA\C) < p{^A\C), and (2.1) 
p(A\C')>p(A\C) \ 



p(B\AAC')=p{B\AAC) 



p(A A B\C) > p{A A B\C). (2.2) 



In words, (2.1) says that if information C gets updated to C in such a way 
that the plausibility of A is increased, then the plausibility of the negation of 
A is decreased; (2.2) says that if, in addition, the plausibility of B given A is 
unchanged, then the plausibility that both A and B are true must increase. 



J The "robot" is a machine that performs plausible reasoning according to the desiderata. 
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Chapter 2 of [12] shows that desiderata LTIIb imply that plausibilities have a 
probability representation and they obey Bayes's rule, and that desideratum 
IIIc implies Laplace's Principle of Indifference [17] for setting up priors. 

In order to derive MaxEnt and Bayes's rule, I first axiomatize the quantity 
which I call information gain and derive its functional form. The axioms, which 
are all intuitively appealing, are as follows. 

IG-1. Numerical representation: the information gain / is a function of prior 
plausibility p and posterior plausibility q. 

IG-2. Smoothcness and monotonicity: the information gain is a smooth, in- 
creasing function in posterior plausibility. 

IG-3. Path independence: the total information gain of updating the prior plau- 
sibility p to the posterior q is independent of the path it is updated. 
That is, if there are two paths p — > r — > q and p — > r' — > q, then 
I(p,r) + I(r,q) = I(p,r') + I{r',q). 

IG-4. Independence from the choice of unit: whatever unit we choose to describe 
plausibility, the information gain should have the same value. That is, 
I(tp, tq) = I(p, q) for t > 0. 

IG-5. Zero information gain for not updating: for any p, we have I(p,p) = 0. 

Proposition 1. Suppose that axioms IG-l-IG-5 hold. Then I(p,q) = fclog|, 
where k > is an arbitrary constant. 

Proof. Since by axiom IG-2 the information gain I(p, q) is smooth in q, it is 
partially diffcrcntiable with respect to q and / can be recovered by integrating 
its partial derivative. Differentiating I(p,r) + I{r,q) = I(p,r') + I(r',q) with 
respect to q, we get 

g(^) = g(^). (2.3) 

The left-hand side of (2.3) is a function of (r, q), and the right-hand side of (2.3) 
is a function of (r', q). Since r, r' are arbitrary, (2.3) must be a function of only 
q. Let q) = g(q). By integration we get 7(r, q) = F(r) + G(q), where F is 
some function and G — J g. By the path independence axiom IG-3, we get 

[F(jp) + G(r)] + [F(r) + G{q)] =[F(p) + G(r')\ + \F{r') + G(q)} 

F(r) + G(r) = F{r') + G{r'). (2.4) 

Since (2.4) holds for any r,r', F(r) + G(r) is constant, but it must be zero by 
axiom IG-5: F(r)+G(r) = I(r,r) = 0. Therefore I(p, q) = F(p)+G(q) = G(q)- 
G(p). By axiom IG-4, we have G(tq) - G(tp) = G(q) - G(p). Differentiating 
both sides with respect to q, we get tG'(tq) = G'(q). Multiplying both sides 
by q and letting x = tq, we get xG'(x) = qG'(q), so the function xG'(x) is a 
constant k. Integrating G'(x) = k/x yields G(x) = klogx + C, hence I(p,q) = 
G(q) — G(p) = fclog i. Since I is increasing in q by axiom IG-2, we get k > 0. 
Clearly this function satisfies all axioms IG-l-IG-5. □ 
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From now on let us normalize the arbitrary constant k to 1, so the informa- 
tion gain is given by I(p, q) = log ^. This result, 

. . posterior plausibility 

information gam = log : - —— , 

prior plausibility 

is mathematically identical to [5, p. 4], although Goldman takes this as the 
definition. 3 

In order to make plausible reasoning based on available information, consider 
the following desiderata. 

I. Degrees of plausibility are represented by probabilities. 

II. The robot always takes into account all of the evidence it has relevant to a 
question. It does not arbitrarily ignore some of the information, basing its 
conclusions only on what remains. In other words, the robot is completely 
nonidcological. 

III. Aristotelian logic: the robot assigns zero plausibility to propositions that 
contradict its knowledge. 

IV. The robot always represents equivalent states of knowledge by equivalent 
plausibility assignments. That is, if in two problems the robot's state of 
knowledge is the same (except perhaps for the labeling of the propositions), 
then it must assign the same plausibilities in both. 

V. Given prior plausibilities, the robot updates the plausibilities by minimiz- 
ing the average information gain of the posterior plausibilities subject to 
known information. In other words, the robot is maximally conservative. 

Desideratum I is stronger than Jaynes's desideratum J-I because I assume 
that the plausibility is a probability (i.e., finitely or countably additive measure). 
In particular, the plausibilities of mutually exclusive propositions are additive: 
if A, B are mutually exclusive propositions, then p(A V B) = p(A) + p(B). 
Desideratum II is identical to J-IIIb. Desideratum III might be interpreted as a 
special case of II and probably needs no justification, but I need it nevertheless. 
Desideratum IV is identical to J-IIIc, Laplace's Principle of Indifference, which 
may or may not be necessary to prove subsequent theorems. 

Desideratum V is the major difference from Jaynes's axioms. While Jaynes 
imposes "qualitative correspondence with common sense" (J-H), I impose that 
the robot is maximally conservative. This axiom makes sense, for if the robot 
radically updates the plausibilities (i.e., not sticking to its prior), then it should 
not have set up the particular prior plausibilities in the first place. To avoid 
unnecessary reference to axiom numbers, let us group the desiderata as follows: 



3 In information theory the quantity — logp is known as the self-information, although I 
was unable to find a reference for its origin (Tribus [21] calls it surprisal). Our information 
gain I(p, q) = log ^ is the difference of the self-information of the prior and posterior. Kullback 
and Leibler [16] call log £L the information for discrimination, where pi,P2 are general prob- 
abilities and not necessarily the prior and the posterior. The prior /posterior interpretation of 
p and q can also be clearly seen in [7, 8] . 
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I— III: Weak Axioms of Plausible Reasoning 

I-IV: Strong Axioms of Plausible Reasoning 

IG-l-IG-5 and V: Minimum Information Gain Principle 



3 Implications of the Axioms 

In this section I show that the new axioms imply Baycsian inference, maximum 
likelihood, maximum entropy principle, and minimum relative entropy principle. 

Theorem 2. Weak plausibility and the minimum information gain principle 
imply the minimum relative entropy principle ( the minimum discrimination in- 
formation principle of Kullback [15, p. 37]). 

Proof. Let { Ai } be propositions that are mutually exclusive and exhaustive. 
Let pi = p(Ai\I) be the prior plausibility of proposition Ai given background 
information /, and = p{Ai\I') be the posterior plausibility to be computed 
given the new information I' . By desideratum I, we have Pi,qi > and ^2pt = 
^2qi = 1 • Since by Proposition 1 the information gain of Ai is log — , the ex 
post average information gain is 

n 

H(q;p) := V <? 4 log — , 

the relative entropy. 4 By desiderata II and V, the robot minimizes H(q;p) 
subject to all known information I' and the constraints g,- > 0, ^2q% = 1, which 
is precisely the minimum relative entropy principle. □ 

Corollary 3. Strong plausibility and the minimum information gain principle 
imply the maximum entropy principle of Jaynes [9] for setting up priors. 

Proof. Desideratum IV is nothing but Laplace's Principle of Indifference. Hence, 
by desideratum I, the robot assigns the prior plausibility p{A{) = ^. By Theo- 
rem 2 the robot computes the posterior plausibility pi = p{Ai\I) by minimizing 

n n n 

X! pilog i7~ = YL pi ( logpi + lo s n ) = X^ pilogpi +logn ' 

i=l ' i=l i=l 

(where we have invoked desideratum I: ^pt = 1) or equivalently, maximiz- 
ing Shannon's entropy H(p) = — Y^7=i Pi l°gPi [19]- This is precisely Jaynes's 
maximum entropy principle [9] . □ 

I propose to define the Optimum Information Principle by the combination 
of the weak or strong plausibility and the minimum information gain principle, 
despite its implication is the well-known minimum relative entropy principle. 
There are two reasons to avoid the term "entropy" . First, "entropy" is a mis- 
nomer both in physics (see [2]) and in information theory. According to [22], 
Shannon [19] named his measure of uncertainty or missing information "en- 
tropy" following the advice of von Neumann: "[It] has been used in statistical 
mechanics under that name . . . [and] no one knows what entropy really is, so in 

4 This quantity was first proposed by Kullback and Lciblcr [16], which they call, appropri- 
ately, "information" . 
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a debate you will always have the advantage." Clausius coined the word "en- 
tropy" after the Greek word for "transformation"; given that "entropy" is a 
misnomer, adding the adjective "relative" makes it only worse. Second, as a 
measure of information gain the Kullback-Leibler information H(q;p) is more 
fundamental than the Shannon entropy H(p) as shown by the above axiomatic 
derivation as well as the comparison of the two information measures provided 
in [8] : the Kullback-Leibler information, unlike the Shannon entropy, extends to 
arbitrary probability measures and it satisfies an additivity property. Since by 
desideratum V the quantity H(q;p) is the average information gain, and since 
Kullback and Leiblcr [16] call H(q;p) "information" before the term "relative 
entropy" was coined, the term Optimum Information Principle seems best. 5 

Theorem 4. Weak plausibility and the minimum information gain principle 
imply Jaynes's desiderata I-IIIb, in particular Bayes's rule. Therefore, the Op- 
timum Information Principle is consistent with Bayesian inference. 

Proof. Let us first prove Bayes's rule. Suppose that the robot is given back- 
ground information I and that the robot has prior plausibilities on the proposi- 
tions Ax,..., An, B, and any logical conjunction or negation generated by them. 
Therefore the prior plausibilities of A4 A Aj, A4 A B, Ai A ( _i -B), etc., which are 
denoted by p{A t n Aj\I), p{A t n B\I),p{A t n B C \I), etc., are well defined. The 
task of the robot is to update the plausibilities of { Ai } when it is given ad- 
ditional information B. Since there are only a finite number of propositions, 
without loss of generality we may assume that { Ai } are mutually exclusive and 
exhaustive. By desideratum I, we have ^"=1^(^1-0 = 1- 

Let us denote the posterior plausibilities by q(Ai n B\B n I), etc. In order 
to compute them, by Theorem 2 the robot solves 

min \ q log — subject to (3.1a) 
q ^— ' p 

n 

^{q{A % n B\B n /) + q(Ai (1 B C \B n I)) = 1, (3.1b) 

i=l 

(Vi) q(A t n B C \B n I) = 0, (3.1c) 

where p, q in (3.1a) take all possible forms of p(Ai fl B\I),q(Ai H B\B n I) and 
p(Ai<lB c \I), q(Ai<lB c \Bnl). Conditions (3.1b) and (3.1c) come from desiderata 
I and III: since —iB (and hence Ai A (~^B)) is logically impossible knowing B, the 
robot assigns zero plausibility to Ai A {~^B). That we impose (3.1b) and (3.1c) 
and nothing else comes from using all relevant information as in desideratum 
II. 

Since the function f(q) = q\og^ is continuous and strictly convex and the 
constraints (3.1b), (3.1c) constitute a compact convex set, we can apply the 
Karush-Kuhn- Tucker theorem to solve (3.1). Let A be the Lagrange multiplier 
corresponding to (3.1b). The Lagrangian is 

n / n 

L(q, A) = £>l°g-+A E*- 1 

i=l Pl \i=l 

5 [6] calls it the Maximum Information Principle, meaning that the missing information 
is maximized. Maximizing the missing information is equivalent to minimizing the infor- 
mation gain as we do here. The adjective "optimum" avoids the confusion between maxi- 
mum / minimum. 
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where we have used the shorthand qi = q{Ai n B\B n I) and pi = p(Ai D 
The first-order condition, which is necessary and sufficient, reads 

^ = log®+l + A = 0. 

oqi Pi 

This shows that qi is proportional to pi , so by ^2 qi = 1 we obtain q$ = 
Pi/Yn=iPi- Therefore, 

q{Ai\B n /) = q(Ai n B\B D I) + q{Ai n s c |s n I) 
= q{A l C\B\BC\I) = q l 

P (A t nB\i) P (AjnB\i) 
Er=iK^-ns|7) ' ^ J 

where the first equality holds because q is a probability (desideratum I), the 
second equality holds because q(Ai n B C \B n I) = (desideratum III), and the 
last equality holds because p is a probability and { Ai } are mutually exclusive 
and exhaustive. (3.2) is precisely the Bayes rule. 

Now let us show that Jaynes's desiderata I-IIIb are implied. All we need 
to show are desiderata II (conditions (2.1) and (2.2)) and Ilia. (2.1) holds 
because plausibility has a probability representation by desideratum I. (2.2) 
holds by Bayes's rule, which we have already deduced in (3.2). Desideratum 
Ilia holds by the Additivity Theorem of Hobson and Cheng [8, p. 308], where 
they essentially show that if the robot has initial background information Iq 
that gets updated to I\ and then to 1% , with plausibilities po,Pi,P2 respectively, 
then 

H(p2-,Po) = H(p 2 ;p 1 ) + H{pi;p a ), 

that is, the (minimized) Kullback-Leibler information is additive. 6 In particular, 
if there are two ways to update, Iq — > I\ — > I 2 and Iq — > 7{ — > I2, then we obtain 

H(p 2 ;p 1 )+H(p 1 ;p Q ) = H (p 2 ; P [) + H (pi; Po ) , (3.3) 

the path independence. Therefore Jaynes's desideratum Ilia holds because if a 
conclusion can be reasoned out in more than one way, the path independence 
property (3.3) ensures that every possible way leads to the same result. □ 

At this point I stress the distinction between our axiomatization and other 
author's. In his seminal work [19], Shannon imposes as the third axiom "If a 
choice be broken down into two successive choices, the original H should be the 
weighted sum of the individual values of H v , in which he implicitly uses Bayes's 
rule. The same remark applies to the axiomatization of the Kullback-Leibler 
information by Hobson [7]. Similarly, in the important axiomatization of the 
maximum entropy principle, Shore and Johnson [20] implicitly use Bayes's rule 
in their fourth axiom "Subset Independence: It should not matter whether one 
treats an independent subset of system states in terms of a separate conditional 
density or in terms of the full system density". Zellner [24] derives Bayes's 
rule from an "information processing rule" , but it is not clear how it relates to 
maximum entropy and his definition of information seems somewhat arbitrary. 

6 This property is mathematically equivalent to the "Subset Independence" axiom of Shore 
and Johnson [20, p. 27] . 
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On the contrary, Cox [3] and Jaynes [12] derive the Bayes rule from intuitively 
appealing first principles, as we have done. In addition we have derived the 
maximum entropy principle and the minimum relative entropy principle. 

Finally, let us show that the Optimum Information Principle implies maxi- 
mum likelihood. 

Theorem 5. The Optimum Information Principle implies the maximum like- 
lihood principle of Fisher [4]. 

Proof. Suppose that { X n } n=1 are independently and identically distributed 
random variables with an unknown density /. Given the realizations {x n }, 
suppose that the statistician wishes to fit a parametric density f(x; 9) to /, 
where 9 £ O is a parameter. Although prior and posterior distributions are 
meaningless for a frcqucntist, it is natural to interpret that the model f(x; 9) 
and the truth / correspond to the prior and posterior, respectively. Hence to 
make an optimal inference the statistician should choose 9 so as to minimize 
the Kullback-Lciblcr information 



so the statistician should maximize the log likelihood log/(x„; 9). □ 

4 Concluding Remarks 

The maximum entropy principle has occasionally been criticized ad hoc as "Why 
maximize entropy (or minimize relative entropy) , why not other functions?" . An 
inference method is valuable if and only if it is useful in analyzing real data, 
and therefore an inference method requires no interpretation, and no justifica- 
tion except practical usefulness. (Nevertheless the justification of the maximum 
entropy principle has been provided [9, 10, 11, 20].) It is well-known that the 
minimum relative entropy principle (maximum entropy principle) and Bayesian 
inference are useful (see [18, 12] and the references therein). Therefore, since our 
Optimum Information Principle implies the minimum relative entropy principle, 
Bayes's rule, as well as maximum likelihood, it should be equally useful. In addi- 
tion we have axiomatized plausible reasoning and derived the maximum entropy 
principle; hence we have answered the question "why maximize entropy?" 
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